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Abstract 


We explore methods for content selection 
and address the issue of coherence in the 
context of the generation of multimedia ar¬ 
tifacts. We use audio and video to present 
two case studies: generation of film trib¬ 
utes, and lecture-driven science talks. For 
content selection, we use centrality-based 
and diversity-based summarization, along 
with topic analysis. To establish coher¬ 
ence, we use the emotional content of mu¬ 
sic, for film tributes, and ensure topic simi¬ 
larity between lectures and documentaries, 
for science talks. Composition techniques 
for the production of multimedia artifacts 
are addressed as a means of organizing 
content, in order to improve coherence. 
We discuss our results considering the 
above aspects. 


1 Introduction 


We focus on the automatic generation of video ar¬ 
tifacts, having as main concerns content selection 
and coherence aspects. Automatic video genera¬ 
tion has been explored in a wide variety of areas, 
including generation of film trailers ( [Brachmann 
et al., 2007| ), conference video proceedings ( |Amir 
et al., 2004), sports ( |Mendi et al., 2013] ), mu¬ 


sic (HUA et al., 2004), and matter-of-opinion doc¬ 


umentaries (Bocconi et al., 2008). 


In our work, specifically, as usual in text-to-text 
generation, the creation process is guided by an 
original document, the input. In that sense, the 
sequence of segments that composes a video ar¬ 
tifact should correspond to parts that make-up a 


narrative following some intent ( [Branigan, 1992| ). 
Content selection is driven by the text stream cor¬ 
responding to the subtitles of the input document. 
Coherence is addressed by considering the audio 
and video streams. To showcase our approach. 


we generate multimedia artifacts for two distinctly 
purposed case studies: film tributes; and lecture- 
driven science-talks. 


For films, lectures, and documentaries, we use 
subtitles on account of their availability. We use 
timestamps to map text to the audio/video stream. 
All subtitles were segmented at the sentence 
level. Timestamps and punctuation inside sen¬ 
tences were removed. Specifically for films, we 
decided to use subtitles for the hearing-impaired 
due to their resemblance to scripts, which have 
been shown to produce better results in summa¬ 
rization tasks ( | Aparicio et al., 2015| ). 


Our first case study uses a film and song as 
input. The intended output artifact consists of 
a video containing important parts of the film 
along with the specified song. Music is known to 


have a profound effect on humans’ emotions (Pi- 


|card, \99l\ . For this reason, it is often used 
along with stories in order to emphasize their emo¬ 
tive content. In the same way, we address co¬ 
herence by synchronizing the emotive content of 
the song with relevant, emotionally-related, parts 
of the film. In order to maintain some consis¬ 
tency concerning the emotions portrayed by the 
content selected, centrality-based summarization 
approaches may be indicated. We use Support 
Sets ^Ribeiro and de Matos, 2011 1 ) as a ranking al¬ 
gorithm to determine the most central (important) 
sentences from the film, and let the emotions of 
the input song govern which content is presented. 

Our second case study concerns video lectures 
of physics. For this artifact, we use scientific 
documentaries to illustrate the main subjects ad¬ 
dressed in the input lecture. In contrast with our 
first case study, we produce a video from objec¬ 
tive and informative sources. Lectures are talks 
structured around a syllabus that present infor¬ 
mation concerning a specific issue with a set of 
topics. Specifically, physics courses study real 
world phenomena that can be exemplified in vari- 




























ous ways. We focus on obtaining a diverse repre¬ 
sentation of the lecture comprising different top¬ 
ics. For this reason, we summarize its subtitles 
using GRASSHOPPER ( Zhu et al., 2007[ ), which 
maximizes diversity while penalizing redundant 
content. We obtain a thematically-coherent arti¬ 
fact by selecting topic-related content from a col¬ 
lection of documentaries. 

This article is organized as follows: Section 
presents related work for video generation; Sec¬ 
tion presents our first case study: film tributes. 
Section [^presents our second case study: lecture- 
driven science-talks. Section [^presents a discus¬ 
sion of our early results; Section presents the 
conclusions and directions for future research. 


2 Related Work 

In the following sections, we present work con¬ 
cerning content selection and coherence aspects 
for automatic generation of multimedia artifacts. 

2.1 Content Selection 

It is essential to extract important content in order 
to generate a video that raises the viewer’s interest. 


For example, |Ma et al. (2002| ) present a method 
that models the user’s attention in order to create 
video summaries. However, despite the numerous 
techniques to produce video skims ( [Truong and 
jVenkatesh, 2007| ), results are still far from human 
expectations. Many approaches neglect the audio, 
due to the difficulty of integrating its features in 


video ( |Li and Merialdo, 2010| ). Other approaches, 
propose the fusion of text, audio, and visual fea¬ 
tures, with relation to particular topics, for multi- 
media summarization ( [Ding et al., 2012 ). Given 
the availability of subtitles for films, lectures, and 
documentaries, text can be exploited to determine 
content. |Evangelopoulos et al. (2013) summarize 


films using its subtitles (text), along with informa¬ 
tion from the audio and the visual streams, inte¬ 
grating cues from these sources in a multimodal 
saliency curve. Auditory saliency is determined 
by cues that compute multifrequency waveform 
modulations, visual saliency is calculated using 
intensity, color, and orientation values, and tex¬ 
tual saliency is obtained through Part-Of-Speech 
(POS) tagging. Several generic summarization al¬ 
gorithms have been developed to determine rele¬ 
vant content, for instance methods based on: cen¬ 
trality ( jErkan and Radev, 2004| [Ribeiro and de 
Matos, 20TT] ); diversity ( jCarbonell and Goldstein, 


1998[|Zhu et al., 2007| ); or uncovering latent struc¬ 
ture ( [Gong and Liu, 200T] ). 


2.2 Coherence 

Extractive summaries are known to lack coher¬ 
ence ( jPaice, 19^ . |Eoltz et al. (1998| ) proposed an 
automatic method that addresses coherence as se¬ 
mantic relations between adjacent sentences. La¬ 
tent Semantic Analysis (ESA) ( jLandauer et al., 
1998|) is used to uncover latent structure of the 


input, then sentences are represented as vectors, 
composed by the means of the words they contain. 
Computing the similarity between sentences deter¬ 
mines aspects of local coherence. Specifically, for 
video generation, coherence can be established by 
means of composition techniques for video pro¬ 
duction, based on temporal constraints, along with 
thematic and structural continuity ( jAhanger and 
Little, 1998] ). Additionally, music data can be ex¬ 
plored as a mechanism to provide coherence. |Wu| 
et al. (201^ produced a music video composed by 
web images and a song. The song lyrics were used 
to search for images in the web, which compose 
the video, based on estimated semantic scores be¬ 


tween an image and a music segment. Irie et al 


(2010| ) generate a film trailer that selects the most 
emotional film segments, and re-order the result¬ 
ing set of shots so as to optimize the shot sequence, 
based on each shot emotional impact. 


3 Case Study 1: Film Tribute 

Eigure[^ shows the processes involved in the pro¬ 
duction of a film tribute, given a film and a song 
as input. The length of the song imposes the du¬ 
ration of the final artifact, which is populated by 
content selected from the film. In order to main¬ 
tain some emotional consistency concerning the 
selected content, we use a centrality-based ap¬ 
proach. Eor this reason, we obtain relevant sen¬ 
tences from the film, summarizing its subtitles, us¬ 
ing Support Sets. This algorithm uncovers groups 
of semantically-related passages. A support set 
is created for each passage of the input, deter¬ 
mined by comparing each passage with all remain¬ 
ing ones from the source. A summary is composed 
by the most relevant passages, which are the ones 
present in the highest number of support sets. Eur- 
thermore, we use the subtitles timestamps to ob¬ 
tain the matching video clips. Eor each one, we 
detect the corresponding scene. Then, we extract 
emotion-related audio features from the music and 

























































Figure 1: Film tribute generation. (1) content se¬ 
lection. (2) emotion synchronization. (3) video 
composition. 


the scenes of the video clips, and compare them to 
obtain the ones that are more emotionally similar 
to the music. The scenes provide more auditory in¬ 
formation concerning the events involving the ex¬ 
tracted video clip. The final video is composed by 
joining the resulting video clips with the specified 
music. 

In order to obtain the film’s scenes, we seg¬ 
ment it using Lav2yuv, a program distributed with 
the MJPEG tools ( |Chen et al., 2012] ), with the 
scene detection threshold fixed to 40. Then, we 
extract emotion-related audio features from the 
video scenes and music, which include 384 fea¬ 
tures as statistical functionals applied to low-level 
descriptor contours (INTERSPEECH 2009 Emo¬ 
tion Challenge feature set from openSMILE ( |Ey- 
ben et al., 2010| )). The 16 low-level descriptors are 
the following: 

• Root-mean-square signal frame energy; 

• Mel-frequency cepstral coefficients 1-12; 

• Zero-crossing rate of time signal (frame- 
based); 

• The voicing probability computed from the 
ACE; 

• The fundamental frequency computed from 
the Cepstrum. 

The resulting vector for each clip is then com¬ 
pared with the music vector using the cosine dis¬ 
tance. If the similarity between them is greater 
than 0.7 (empirically-determined value), we con¬ 
sider that the video clip has the same emotion of 
the music. The length of the audio clip contain¬ 
ing the music is filled with the resulting clips. The 
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Figure 2: Lecture-driven science-talks video gen¬ 
eration architecture: 9i represent topic mixtures. 


final video is composed by joining the resulting 
video clips with the specified music, following the 
chronological order of the input film. 

4 Case Study 2: Lecture-Driven 
Science-Talks 

Figure depicts the architecture of the compu¬ 
tational method developed for the generation of 
lecture-driven science-talks. Our method receives 
as input the lecture’s subtitles. First, we use 
GRASSHOPPER, a diversity-based approach, to 
obtain the most important, yet diverse, set of sen¬ 
tences from the lecture. GRASSHOPPER is a 
graph-based ranking algorithm based on random- 
walks in an absorbing Markov chain, which fo¬ 
cuses on maximizing diversity while minimizing 
redundancy. The list of items returned by the al¬ 
gorithm is computed based on their representation 
of particular groups (centrality), the difference be¬ 
tween them (diversity), and prior ranking. Then, 
we determine, for each sentence, topic-related 
content from a collection of documentaries, as 
a means of obtaining thematically-coherent con¬ 
tent. In the following sections, we detail two ma¬ 
jor approaches carried out using Latent Dirichlet 
Allocation (LDA): (i) model trained at sentence- 
level; (ii) inferring the lecture in a model trained 
at document-level, in order to obtain a subset of 
topic-related documentaries, then training an addi¬ 
tional model for this subset at sentence-level. Ei- 
nally, we present the steps involved in choosing 
the best candidates to compose the final artifact. 













































































































4.1 LDA-Based Content Selection 


LDA ( |Blei et al., 2003| ) is a generative proba¬ 
bilistic model that explains document collections 
based on mixtures of topic distribution. In this 
sense, LDA follows the intuitive notion that docu¬ 
ments exhibit multiple topics. Each topic is char¬ 
acterized as a distribution over words in a fixed vo¬ 
cabulary, and each document is a random mixture 
of topics. Given the trained model, is it possible to 
find related documents by assessing the similarity 
of their topic mixtures. 


4.1.1 Sentence-Level Model 


4.2 Ranking 

Given the top-A: candidates, we want to choose the 
sentence that best represents the group. For this 
reason, we use the Support Sets algorithm, in order 
to let sets of candidate sentences compete among 
them and, thus, obtaining a sentence ranking. In 
our current work, the final video is composed by 
the best ranked candidate sentences of documen¬ 
taries. However, since the same sentence can be 
present in more than one lecture segment, which 
would result in redundant content in the final arti¬ 
fact, only unused candidates are chosen (if the best 
ranked sentence was already selected, the next in 
rank is used). 


We train an LDA model for the collection of 
documentaries using 100 topics at sentence-level, 
uncovering the hidden thematic structure of the 
collection. Then, we situate new data, namely 
the summarized lecture, into the estimated model. 
Therefore, we fit the summarized lecture into the 
topic structure learned using variational inference. 
Now, the sentences of the summarized lecture and 
the collection of documentaries can be compared 
by looking at how similar their topic mixtures are. 
Given that GRASSHOPPER aimed at maximizing 
the diversity of the condensed lecture, we posit 
that each sentence of the summary regards dif¬ 
ferent topics. As a result, we compare, based 
on topic mixtures, each sentence of the lecture 
with sentences from documentaries, using the co¬ 
sine distance, and obtain a top-A: containing topic- 
related candidate documentary sentences for each 
lecture’s sentence. As a first approach, all experi¬ 
ments were made with k fixed at 10. 


5 Results and Discussion 

We conducted preliminary experiments in order to 
incorporate the viewer’s feedback into the gener¬ 
ation process. Several issues were identified: (i) 
the pace at which the video progresses, the re¬ 
sult of concatenating video clips that correspond 
to sentence-level segments; (ii) the text stream 
is mapped to the video using subtitles, occasion¬ 
ally, causing the time interval corresponding to 
the sentences of the subtitles not to encompass the 
speech that it is portraying; (hi) overlapping music 
with video segments can clutter the audio stream 
by making the video’s speech unclear; and, (iv) 
when joining audio/video segments from different 
sources, the lack of loudness consistency in the 
final audio stream affects the overall user expe¬ 
rience. In the following sections, we discuss our 
approaches for content selection and address the 
aspects of coherence in light of our early results. 


4.1.2 2-Stage Model 

In this approach, we first train an LDA model for 
the collection of documentaries using 100 topics 
at document-level. Then, by fitting the lecture- 
document in the estimated model, we use its mix¬ 
ture topics to obtain the most relevant documen¬ 
tary topic-related documents. This first step al¬ 
lows a focused determination of the candidate seg¬ 
ments. After that, the procedure presented above 
is applied. The model is now trained at sentence- 
level, for the subset of documentaries using a 
smaller number of topics, for instance, 10. As 
before, the lecture is situated into this new model 
and the cosine distance determines the top-A: with 
topic-related candidate documentary sentences. 


5.1 Content Selection 

In our work, content selection is driven by a text 
stream that corresponds to transcripts of speech 
monologues and dialogs, presented in the input 
document’s subtitles. In that sense, we do not 
detect important content based on visual or au¬ 
dio cues, except those corresponding to speech 
(via subtitles). Hereof, other approaches can be 
used ( [Coldefy et al., 2004t|Coldefy and Bouthemy, 
2004[ ). For text-based selection, different ap¬ 
proaches are available depending on the aspects of 
interest (such as diversity). 

For film tributes, which target the viewer’s emo¬ 
tions, algorithms that focus on the most central 
(important) content may be indicated. Apart from 
Support Sets, other algorithms can be used, for in- 









stance, LexRank ( |Erkan and Radev, 2004| ), which 
is also a centrality-based algorithm. In contrast 
with film tributes, lecture-driven science-talks are 
instructive. Thus, in order to capture sentences re¬ 
lated to the various topics in the lecture, other al¬ 
gorithms based on diversity can be used, for in¬ 
stance, MMR ( |Carbonell and Goldstein, 19981 ), 
that provides a model that linearly combines rel¬ 
evance and novelty. 

Our choices were based on previous work 
(omitted for blind review) that shows that Support 
Sets and GRASSHOPPER provide better sum¬ 
maries for the content selection phase. 


5.2 Coherence 

Regarding film tributes, our experiments show that 
viewers are able to establish correspondence be¬ 
tween the emotions of the film and the accompa¬ 
nying music. However, to improve results, the¬ 
matic coherence can be considered: if the song has 
lyrics, they can also be taken into account to relate 
its topics to the film. 

Regarding lecture-driven science-talks, one of 
the drawbacks of sentence-to-sentence substitu¬ 
tion is the lack of continuity and the perceived 
fragmentation of the final artifacts. Although topic 
analysis was used to address thematic coherence, 
our preliminary experiments show that it is not 
enough for a viewer to regard the final video as 
fluid and well-composed. Additionally, consider¬ 
ing our two model approaches, the 2-stage model 
seems to be better than the sentence-level model: 
most of the content is provided by the subset of 
documentaries, the final video contains segments 
focused on less documentary episodes. However, 
this problem merits further study. 

Overall, both methods use subtitles segmented 
at sentence-level. Eor film tributes, the song is 
used to obtain an emotionally-coherent multime¬ 
dia artifact. Results show that segmentation at 
sentence-level does not affect, significantly, its 
overall coherence. In contrast, for informative 
videos, sentence organization in the final artifact 
is of critical importance: although thematic coher¬ 
ence can be easily identified, the displayed content 
progresses without causality through the video 
segments that compose the artifact, adversely af¬ 
fecting the viewer’s overall experience. Eurther- 
more, video segments obtained from the input 
subtitles usually have a specified duration longer 
than the corresponding speech audio segment. As 


a result, the video segments have abrupt transi¬ 
tions, sometimes with considerable color varia¬ 
tions. This last effect was clearly pointed out as 
a limitation by viewers. We plan to address these 
issues in the near future. 


6 Conclusions and Future Work 


We presented methods for generation of multi- 
media artifacts, focusing on content selection and 
coherence aspects. We produced two types of 
video: film tributes, and lecture-driven science- 
talks. Each case study considers a different cre¬ 
ative intent. While film tributes intend to appeal 
to the emotive-side of the viewer, lecture-driven 
documentaries are expected to provide a more in¬ 
structive experience. Although our preliminary 
experiments had a good audience feedback, fur¬ 
ther improvements regarding aspects of coherence 
still need to be made. 

Regarding future work and concerning the iden¬ 
tified issues, in the case of film tributes, if the 
music has vocals, its lyrics can be taken into ac¬ 
count to relate their topics to the film. A pos¬ 
sible solution is to receive only the film as in¬ 
put, then, choose a topic-related song from an 
existing dataset, for instance. The Million Song 
Dataset ( |Bertin-Mahieux et al., 20lT] ). Eurther- 
more, considering music with vocals, audio ad¬ 
justments can be made, specifically, in moments 
where the music’s energy is high enough to in¬ 
terfere with speech from the video. Also, the fi¬ 
nal video’s volume can be adjusted when the mu¬ 
sic’s energy is higher than some threshold. Eur- 
thermore, to improve the final artifact’s structural 
coherence, we can take into account the music’s 


structure and align it to the video stream ( [Nieto 
and Bello, 2014] ). 

Still regarding coherence, we intend to identify 
locally-coherent sentences for each method’s input 
process: the summarized film, and lecture. LSA 
can be used as a technique for measuring coher¬ 
ence, by comparing vectors of adjacent sentences 
in the generated semantic space. Thus, they can 
be considered as groups of locally-coherent sen¬ 
tences. Eor films, these groups can be directly 
used. Eor lectures, we can identify the topic mix¬ 
tures that best represent each group and replace 
them by choosing other groups of locally-coherent 
documentary sentences. 

Video composition techniques for video pro¬ 
duction can be used as a means to build a narra- 















live, which can be seen as a series of events in a 
chain ( [Branigan, 1992[ ). To resolve the identified 
abrupt shot transitions, we can use the underlying 
audio stream to provide continuity cues. For that, 
a data-drive voice-activity detector based on Long 
Short-Term Memory Recurrent Neural Networks 
(LSTM-RNN) can be used ( [Eyben et al, 2013] ). 
Furthermore, content progression in the final mul¬ 
timedia artifact can be established by comparing 
adjacent video segments and ensuring that they are 
not too similar or too different ( [Ahanger and Lit¬ 


tle, 1998). 
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